Inference for a Large Directed Acyclic Graph with Unspecified Interventions

Statistical inference of directed relations given some unspecified interventions (i.e., the intervention targets are unknown) is challenging. In this article, we test hypothesized directed relations with unspecified interventions. First, we derive conditions to yield an identifiable model. Unlike classical inference, testing directed relations requires to identify the ancestors and relevant interventions of hypothesis-specific primary variables. To this end, we propose a peeling algorithm based on nodewise regressions to establish a topological order of primary variables. Moreover, we prove that the peeling algorithm yields a consistent estimator in low-order polynomial time. Second, we propose a likelihood ratio test integrated with a data perturbation scheme to account for the uncertainty of identifying ancestors and interventions. Also, we show that the distribution of a data perturbation test statistic converges to the target distribution. Numerical examples demonstrate the utility and effectiveness of the proposed methods, including an application to infer gene regulatory networks. The R implementation is available at https://github.com/chunlinli/intdag.


Introduction
Directed relations are essential to explaining pairwise dependencies among multiple interacting units. In gene network analysis, regulatory gene-to-gene relations are a focus of biological investigation (Sachs et al., 2005), while in a human brain network, scientists investigate causal influences among regions of interest to understand how the brain functions through the regional effects of neurons (Liu et al., 2017). In such a situation, a Gaussian directed acyclic graph (DAG) is commonly employed to describe the directed relations; however, inferring the directed effects without other information is generally impossible because a Gaussian DAG often lacks model identifiability (Van de Geer and Bühlmann, 2013). Hence, external interventions are introduced to treat a non-identifiable situation (Heinze-Deml et al., 2018). For instance, the genetic variants such as single-nucleotide polymorphisms (SNPs) can be, and indeed are increasingly, treated as external interventions to infer inter-trait causal relations in a quantitative trait network ( Brown and Knowles, 2020) and gene interactions in a gene regulatory network (Teumer, 2018;Molstad et al., 2021). In neuroimaging analysis, scientists use randomized experimental stimuli as interventions to identify causal relations in a functional brain network (Grosse-Wentrup et al., 2016;Bergmann and Hartwigsen, 2021). However, the interventions in these studies often have unknown targets and off-target effects (Jackson et al., 2003; Eaton and Murphy, 2007). Consequently, inferring directed relations while identifying useful interventions for inference is critical. This paper focuses on the simultaneous inference of directed relations subject to unspecified interventions (i.e., the intervention targets are unknown). In a DAG model, the research has been centered on the reconstruction of directed relations in observational and interventional studies (Van de Geer and Bühlmann, 2013;Oates et al., 2016;Zheng et al., 2018;Yuan et al., 2019); see Heinze-Deml et al. (2018) for a review. For inference, Bayesian methods (Friedman and Koller, 2003; Luo and Zhao, 2011;Viinikka et al., 2020) have been popular. Yet, statistical inference remains under-studied, especially for interventional models in high dimensions (Peters et al., 2016;Rothenhäusler et al., 2019). Recently, for observational data, Janková and van de Geer (2018) propose a debiased test of the strength of a single directed relation, and Li et al. (2020) derives a constrained likelihood ratio test for multiple directed relations.
Despite progress, challenges remain. First, inferring directed relations requires to identify a certain DAG topological order (Van de Geer and Bühlmann, 2013), while the identifiability in a Gaussian DAG with unspecified interventions remains under-explored. Second, the inferential results should agree with the acyclicity requirement for a DAG. As a result, degenerate and intractable situations can occur, making inference greatly different from the classical ones. Third, likelihood-based methods for learning the DAG topological order often use permutation search (Van de Geer and Bühlmann, 2013) or continuous optimization subject to the acyclicity constraint (Zheng et al., 2018;Yuan et al., 2019), where a theoretical guarantee of the actual estimate (instead of the global optimum) has not been established for these approaches. Recently, an important line of work ( Ghoshal and Honorio, 2018;Rajendran et al., 2021;Rolland et al., 2022) has focused on order-based algorithms with computational and statistical guarantees. However, in Gaussian DAGs, existing methods often rely on some error scale assumptions (Peters and Bühlmann, 2014), which is sensitive to variable scaling like the common practice of standardizing variables. This drawback could limit their applications, especially in causal inference, as causal relations are typically invariant to scaling.
To address the above issues, we develop structure learning and inference methods for a Gaussian DAG with unspecified additive interventions. Unlike the existing approach structure learning for inference. With suitable interventions called instrumental variables (IVs), the proposed approach removes the restrictive error scale assumptions and delivers creditable outcomes with theoretical guarantees in low-order polynomial time. This indicates IVs, a well-known tool in causal inference (Angrist et al., 1996), can play important roles in structure learning even if some interventions do not meet the IV criteria. Our contributions are summarized as follows.

•
For modeling, we establish the identifiability conditions for a Gaussian DAG with unspecified interventions. In particular, the conditions allow interventions on more than one target, which is suitable for multivariate causal analysis (Murray, 2006).

•
For methodology, we develop likelihood ratio tests for directed edges and pathways in a super-graph of the true DAG, called the ancestral relation graph (ARG), where the ARG is formed by ancestral relations and candidate interventional relations, offering the topological order for inference. We reconstruct the ARG by the peeling algorithm, which automatically meets the acyclicity requirement. On this basis, we introduce the concepts of nondegeneracy and regularity to characterize the behavior of hypothesis testing under a DAG model. By integrating structure learning with inference, we account for the uncertainty of ARG estimation for the proposed tests via a novel data perturbation (DP) scheme, which effectively controls the type-I error while enjoying high statistical power.
• For theory, we prove that the proposed peeling algorithm based on nodewise regressions yields consistent results in O(p × logκ max ∘ × (q 3 + nq 2 )) operations almost surely, where p, q are the numbers of primary and intervention variables, n is the sample size, and κ max ∘ is the sparsity. Then we justify the proposed DP inference method by establishing the convergence of the DP likelihood ratio to the target distribution and desired power properties.

•
The numerical studies and real data analysis demonstrate the utility and effectiveness of the proposed methods. The implementation of the proposed tests and structure learning method is available at https://github.com/chunlinli/intdag.
The rest of the article is structured as follows. Section 2 establishes model identifiability and states two inference problems of interest. Section 3 develops the proposed methods for structure learning and statistical inference. Section 4 presents statistical theory to justify the proposed methods. Section 5 performs simulation studies, followed by an application to infer gene pathways from gene expression and SNP data in Section 6. Section 7 concludes the article. The Appendix contains illustrative examples and technical proofs.

Gaussian directed acyclic graph with additive interventions
To infer directed relations among p primary variables (i.e., variables of primary interest) Here, (A) requires an intervention to be active, while (B) prevents simultaneous interventions of a single intervention variable on multiple primary variables. This is critical to identifiability because an instrument on a (potential) cause variable Y 1 helps reveal its directed effect on an outcome variable Y 2 , which breaks the symmetry in a Gaussian DAG that results in non-identifiability of directed relations Y 1 → Y 2 and Y 2 → Y 1 .

Remark 1
In the literature (Angrist et al., 1996), an instrument X for estimating the effect from a potential cause Y 1 to the outcome Y 2 is required to satisfy (i) X is related to Y 1 , called relevance, (ii) X has no directed edge to the outcome Y 2 , called exclusion, and (iii) X is not related to unmeasured confounders, called unconfoundedness. In Definition 1, (A) is the relevance property, (B) generalizes the exclusion property for a DAG model, and the unconfoundedness is satisfied because no confounder is present in model (1).
Next, we make some assumptions on intervention variables to yield an identifiable model, where dependencies among intervention variables are permissible.
Assumption 1 Assume that model (1) satisfies the following conditions.
(1B) Cov(Y j , X l | X 1, …, q ∖ l ) ≠ 0 if X l intervenes on any unmediated parent of Y j .
(1C) Each primary variable is intervened by at least one instrument.
Assumption 1A imposes mild distributional restrictions on X, permitting discrete variables such as SNPs. Assumption 1B requires the interventional effects through unmediated parents not to cancel out, as multiple targets from an invalid instrument are permitted. Importantly, if either Assumption 1B or 1C fails, model (1) is generally not identifiable, as shown in Example 2 of Appendix A.1. In Section 5, we empirically examine the situation when Assumption 1C is not met.
Proposition 1 (proved in Appendix B.1) is derived for a DAG model with unspecified interventions. This is in contrast to Proposition 1 of Chen et al. (2018), which proves the identifiability of the parameters in a directed graph with target-known instruments on each primary variable. Moreover, it is worth noting that the estimated graph in Chen et al. (2018) may be cyclic and lacks the local Markov property for causal interpretation (Spirtes et al., 2000).

Problem statement: inference for a DAG
Our goal is to perform statistical inference of directed edges and pathways in the DAG G. Let ℋ ⊆ (k, j): k ≠ j, 1 ≤ k, j ≤ p be an edge set among primary variables {Y 1 , … , Y p }, where (k, j) ∈ ℋ specifies a (hypothesized) directed edge Y k → Y j in (1). We shall focus on two types of testing with null and alternative hypotheses H 0 and H a . For simultaneous H 0 : U kj = 0; for all (k, j) ∈ ℋ versus H a : U kj ≠ 0 for some (k, j) ∈ ℋ; (2) for simultaneous testing of directed pathways, H 0 : U kj = 0; for some (k, j) ∈ ℋ versus H a : U kj ≠ 0 for all (k, j) ∈ ℋ, where U ℋ c , W, Σ are unspecified nuisance parameters and c is the set complement. Note that H 0 in (3) is a composite hypothesis that can be decomposed into sub-hypotheses H 0, ν : U k ν , j ν = 0, versus H a, ν : U k ν , j ν ≠ 0; ν = 1, …, ℋ , where ℋ = k 1 , j 1 , …, k |ℋ| , j |ℋ| and testing each sub-hypothesis is a directed edge test. Thus, we treat (3) as an extension of (2).
We will also estimate (U, W) as well as identify the nonzero elements in U to recover the directed edges among the primary variables Y in G.

Methodology
This section develops the main methodology, including the peeling algorithm for structure learning and the data perturbation inference for simultaneous testing of directed edges (2) and pathways (3). To proceed, suppose the data matrices Y n × p = Y 1 , …, Y n ⊤ and X n × q = X 1 , …, X n ⊤ are given, where the rows {(Y i ⊤ , X i ⊤ )} 1 ≤ i ≤ n are independently sampled from (1). Then the log-likelihood is (up to a constant) where θ = (U, W), Σ = Diag(σ 1 2 , …, σ p 2 ), and U is subject to the acyclicity constraint (Zheng et al., 2018;Yuan et al., 2019) in that no directed cycle is permissible in the DAG.
One major challenge to this likelihood approach lies in the optimization of (4) subject to the acyclicity constraint, which imposes difficulty on not only computation but also asymptotic theory. As a result, there is a gap between the asymptotic distribution of a global maximum and that of the actual estimate which can be a local maximum (Janková and van de Geer, 2018;Li et al., 2020). Moreover, the actual estimate may give an imprecise topological order, tending to impact adversely on inference.
To circumvent the acyclicity requirement, we propose to use the ancestral relation graph (ARG) to describe the topological order of the DAG, where the ARG can be efficiently estimated without explicitly imposing the acyclicity constraint while enjoying a statistical guarantee of the actual estimate.
Our plan is as follows. In Section 3.1, we construct G + without the acyclicity constraint for U. On this basis, in Section 3.2 we develop likelihood ratio tests for (2) and (3).

Structure learning via peeling
This section develops a novel structure learning method to construct G + in a hierarchical manner. First, we observe an important connection between primary variables and intervention variables. Rewrite (1) as where Ω = (I − U)Σ −1 (I − U ⊤ ) and V = W(I − U) −1 .

A.
If V lj ≠ 0, then X l intervenes on Y j or an ancestor of Y j ;

B.
In G, Y j is a leaf variable (having no child) if and only if there is an instrument X l such that V lj ≠ 0 and V lj′ = 0 for j′ ≠ j.
The proof of Proposition 2 is deferred to Appendix B.2. Intuitively, V lj ≠ 0 implies the dependence of Y j on X l through a directed path X l → ⋯ → Y j , and hence that X l intervenes on Y j or an ancestor of Y j . Thus, the instruments on a leaf variable are independent of the other primary variables conditional on the rest of interventions. This observation suggests a method to reconstruct the DAG topological order by recursively identifying and removing (i.e., peeling) the leaf variables.
Next, we discuss the estimation of V and construction of G + .

NODEWISE CONSTRAINED REGRESSIONS-We
where 1 ≤ κ j ≤ q is an integer-valued tuning parameter controlling the sparsity and can be chosen by BIC or cross-validation. To solve (7), we use J z; τ j = min |z | /τ j , 1 as a surrogate of I(z ≠ 0) (Shen et al., 2012) and develop a difference-of-convex (DC) program with the ℓ 0 -projection to improve the globality of the solution of (7). Specifically, at the (t + 1)th iteration, given V ⋅ j [t] , we solve the weighted Lasso problem, where γ j > 0 is an internal hyperparameter used by the DC program; see Remark 2 below.
The DC program terminates at [t] ‖ ∞ ≤ tol or t achieves the maximum iteration number, where tol is the machine precision. Then, the solution V ⋅ j of (7) is computed by projecting V ⋅ j onto the set v ∈ ℝ q : ∥ v ∥ 0 ≤ κ j .
Algorithm 1 summarizes the computation method.
Remark 2 In Algorithm 1, γ j is chosen from a set of candidate values γ (r) 1 ≤ r ≤ R . In our implementation, γ j is not directly tuned by the user and γ (r) 1 ≤ r ≤ R is provided by default. When κ j , τ j 1 ≤ j ≤ p are suitably specified by the user, for any value γ (r) lies in the proper ranges, V ⋅ 1 , …, V p are the global solutions of (7) almost surely; see Theorem 4 in Section 4.1. Moreover, solving the DC programs for γ (1) > ⋯ > γ (R) is efficient with the warm start trick (Friedman et al., 2010;Breheny and Huang, 2011).

PEELING-Now,
we describe a peeling algorithm to estimate G + based on V.
Proposition 2 suggests that the leaf variables of G (with their instruments) can be identified based on matrix V. To proceed, let ℒ and its complement ℒ c be (generic) nonempty subsets of {1, … , p} such that Y ℒ c are non-descendants of Y ℒ . Define a sub-DAG Proposition 3 Suppose Assumption 1 is satisfied. Let Y k be a leaf in G ℒ c and Y j be in Y ℒ c .

A.
If V lj ≠ 0 for each instrument X l of Y k in G ℒ c , we have (k, j) ∈ ℰ + .

B.
If Y k is an unmediated pa/rent of Y j , then V lj ≠ 0 for each instrument X l of Y k in G ℒ c .
Proposition 3 (proved in Appendix B.3) together with Proposition 2 indicates that G + can be constructed from V. In particular, we can sequentially identify each leaf Y k with its instrument(s) X l in the DAG G or the sub-DAG G ℒ c (where Y ℒ are peeled variables). Then ℰ + can be constructed by including all edges (k, j) such that Y k is a leaf in the sub-DAG G ℒ c , Y j is a peeled variable, and V lj ≠ 0 for each instrument X l of leaf Y k in G ℒ c . By Proposition 3, (A) confirms that all such edges are in ℰ + so no extra edges are included, and (B) guarantees that every directed edge from an unmediated parent must be included, which is sufficient to determine all the ancestral relations. Thus, ℰ + can be recovered from V. Then ℐ + is equal to (l, j): V lk ≠ 0 if k = j or (k, j) ∈ ℰ + .
The peeling algorithm is summarized in Algorithm 2 and a detailed illustration is presented in Example 3 of Appendix A.2.

Likelihood inference for a DAG
Now, we propose an inference method for testing (2) and (3). First, we derive the likelihood ratio based on G + . Next, we perform tests via data perturbation, accounting for the uncertainty of estimating G + .
Thus, we define the likelihood ratio by where θ , 0) are MLEs (given ℳ) under H 0 and H a , respectively, ℳ = (Y , X; ℰ + ∪ ℋ, ℐ + ) is an estimate for G + , and Σ is an estimate for Σ. Instead of G + = (Y , X; ℰ + , ℐ + ), the graph ℳ (with hypothesized edges being added) is used because we need to test the presence of any edge in ℋ.
In many statistical models, the likelihood ratio often has a nondegenerate and tractable distribution when H 0 is true, for instance, a chi-squared distribution with degrees of freedom ℋ . However, since ℋ is pre-specified by the user, ℳ may not be a DAG, and thus not all edges in ℋ could present in the DAG parameterized by (θ ℳ (1) , 0). As a result, Lr for (2) may converge to a distribution with degrees of freedom less than ℋ and the distribution may be even intractable, making inference for a DAG greatly different from the classical ones, as illustrated by Example 1.
Example 1 Consider the likelihood ratio test under null H 0 and alternative H a for the DAG G displayed in Figure 1. For simplicity, assume G + = G + and ℳ = (Y , X; ℰ + ∪ ℋ, ℐ + ).
• H 0 : U 21 = 0 versus H a : U 21 ≠ 0, where ℋ = (2, 1) . Here, (2, 1) forms a cycle together with the edges in ℰ ∖ ℋ (namely the edges not considered by the hypothesis), and thus ℳ has a directed cycle. Given ℳ, when a random sample is obtained under H 0 , the likelihood tends to be maximized under the ARG G + corresponding to the underlying DAG (which implies U 21 = 0), especially so when the asymptotics kicks in as the sample size increases. Consequently, the likelihood ratio Lr becomes zero, constituting a degenerate situation.
Motivated by Example 1, we introduce the concepts of nondegeneracy and regularity.

Definition 3 (Nondegeneracy and regularity with respect to DAG)
A.

B.
A null hypothesis H 0 is said to be regular with respect to DAG G if D ∪ ℰ contains no directed cycle, where ℰ denotes the edge set of G. Otherwise, H 0 is called irregular.

Remark 4
In practice, D is unknown and needs to be estimated from data. Indeed, D can be computed based on the estimated ARG G + , because a directed edge (k, j) is nondegenerate if and only if {(k, j)} ∪ ℰ + contains no directed cycle.
Nondegeneracy ensures nonnegativity of the likelihood ratio. In testing (2), regularity excludes intractable situations for the null distribution. In testing (3), if H 0 is irregular, then D ∪ ℰ has a directed cycle, which means the hypothesized directed pathway cannot exist due to the acyclicity constraint. Thus, regularity excludes the degenerate situations in testing (3). In what follows, we mainly focus on nondegenerate and regular hypotheses. For the degenerate case, the p-value is defined to be one. For the irregular case of edge test (2), we decompose the hypothesis into regular sub-hypotheses and conduct multiple testing. For the irregular case of pathway test (3), the p-value is defined to be one. More discussions on the implementation in irregular cases are provided in Appendix A.5.

TESTING DIRECTED EDGES VIA DATA PERTURBATION-Assuming
H 0 is nondegenerate and regular, then θ (1) is the MLE subject to the DAG S = (Y , X; ℰ + ∪ D, ℐ + ) and θ (0) is the MLE subject to an additional constraint U ℋ = 0. The likelihood ratio (10) can be further simplified.
Hence, Lr only summarizes the contributions from the primary variables with the (estimated) nondegenerate hypothesized edges, where we estimate Σ = Diag(σ j 1 , …, σ p 2 ) by The likelihood ratio (11) for testing directed edges (2) requires an estimation of G + (and S), where we must account for the uncertainty of G + (and S) for finite-sample inference. To proceed, we consider the test statistic Lr based on a "correct" ARG ℳ ⊇ G + , where ℳ = (Y , X; A, C) ⊇ G + = (Y , X; ℰ + , ℐ + ) means that A ⊇ ℰ + and C ⊇ ℐ + . Intuitively, a "correct" ARG distinguishes descendants and nondescendants, and thus can help infer the true directed relations defined by the local Markov property (Spirtes et al., 2000) without introducing model errors, yet may lead to a less powerful test when ℳ is much larger than G + . By comparison, a "wrong" ARG ℳ ⊉ G + provides an incorrect topological order, and a test based on a "wrong" ARG may be biased, accompanied by an inflated type-I error.
Let Z = (Y, X) denote the data matrix of primary and intervention variables and let e = ε 1 , …, ε n ⊤ denote the error matrix, where the rows are sampled independently from (1). From (11), assuming G + ⊇ G + is a "correct" ARG, the likelihood ratio becomes where (13), we have Y ⋅ , D S (j) U D S (j), j = 0 for all j under the null hypothesis H 0 , while Y D S (j) U D S (j), j ≠ 0 for some j under the alternative hypothesis H a , and thus Lr tends to be large under H a . Now, we propose the data perturbation (DP) method ( Shen and Ye, 2002;Breiman, 1992) to approximate the null distribution of Lr in (13). The idea behind DP is to assess the sensitivity of the estimates through perturbed data Y* = Y + e*, where the rows ε i * ⊤ 1 ≤ i ≤ n of perturbation errors e n × p * is sampled independently from N(0, Σ). Let (Z*, e*) = (X, Y*, e*) be the DP sample. Note that the perturbation errors e* are only injected into Y and the perturbation errors e* are known in the DP sample (Z*, e*). Given (Z*, e*), we compute the perturbation estimate G + * (and S * ) by Algorithms 1-2. In (13), under the null hypothesis H 0 , the likelihood ratio Lr = Λ(Z, e) is a function of observed data Z and unobserved errors e. By definition, the perturbation error e* is accessible in the DP sample (Z*, e*), suggesting the DP likelihood ratio Lr* := Λ(Z*, e*) that is equal to Note that (14) mimics (13) when Y ⋅ , D S (j) U D S (j) = 0. As a result, when G + * ⊇ G + , the conditional distribution of Lr* given the data Z well approximates the null distribution of Lr, where the model selection effect is accounted for by assessing the variability of {A j * , B j * } 1 ≤ j ≤ p over different realizations of (Z*, e*).
In practice, we use Monte-Carlo to approximate the distribution of Lr* given Z. We generate M perturbed samples Z m * , e m * 1 ≤ m ≤ M independently and compute Lr m * 1 ≤ m ≤ M , respectively. Then, we examine the condition G +, m * ⊇ G + by checking its empirical counterpart G +, m * ⊇ G + .
The DP p-value of the edge test in (2) is defined as where I(·) is the indicator function.
Remark 5 Instead of (14), a naive approach is to recompute the likelihood ratio by treating the perturbed sample Z* as Z while not using the information of e*. However, this is infeasible. For explanation, assuming G + * ⊇ G + is a "correct" ARG, then this naive likelihood ratio is equal to Note that {Y ⋅ , j } 1 ≤ j ≤ p given Z are deterministic and do not vanish under either H 0 or H a . Thus, the conditional distribution of this naive likelihood ratio given Z does not approximate the null distribution of Lr, in contrast to the DP likelihood ratio in (14).
Algorithm 3 summarizes the DP method for hypothesis testing.

Remark 6
For acceleration, we parallelize Step 2 in Algorithm 3. Additionally, we use the estimate θ as a warm-start initialization for the DP estimates, effectively reducing the computing time.
Remark 7 (Connection with bootstrap) One may consider parametric or nonparametric bootstrap for Lr. The parametric bootstrap requires a good initial estimate of (U, W). Yet, it is rather challenging to correct the bias of this estimate because of the acyclicity constraint. By comparison, DP does not rely on such an estimate. On the other hand, nonparametric bootstrap resamples the original data with replacement. In a bootstrap sample, only about 63% distinct observations in the original data are used in model selection and fitting, leading to deteriorating performance (Kleiner et al., 2012), especially in a small sample. As a result, nonparametric bootstrap may not well-approximate the distribution of Lr, while DP provides a better approximation of Lr, taking advantage of a full sample.

Theory
This section provides the theoretical justification for the proposed methods.

Convergence and consistency of structure learning
First, we introduce some technical assumptions to derive statistical and computational properties of Algorithms 1 and 2. Let ζ be a generic vector and ζ A be the subvector of ζ with coordinates in A. Let κ j Assumption 2 For constants c 1 , c 2 > 0, Assumption 3 min ) log(q)/n + log(n)/n. Assumption 2 is a common condition for proving the convergence rate of the Lasso (Bickel et al., 2009;Zhang et al., 2014). As a replacement of Assumption 1A, it is satisfied with a probability tending to one for isotropic subgaussian or bounded X (Rudelson and Zhou, 2013). Assumption 3, as an alternative to Assumption 1B, specifies the minimal signal strength over candidate interventions. Such a signal strength requirement is used for establishing the high-dimensional variable selection consistency (Fan et al., 2014;Loh and Wainwright, 2017;Zhao et al., 2018). Moreover, Assumption 3 can be further relaxed to a less intuitive condition, Assumption 5; see Appendix A.3 for details.
Theorem 4 (Finite termination and consistency) Suppose Assumptions 1-3 are met with constants c 1 < 6c 2 , and the machine precision tol ≪ 1/n is negligible. For 1 ≤ j ≤ p, if the tuning parameters κ j , τ j of Algorithm 1 are suitably chosen such that then for any γ j such that τ j −1 (32c 2 2 Ω jj −1 n −1 (log(q) + log(n))) 1/2 ≤ γ j ≤ c 1 /6, almost surely we have Algorithm 1 yields a global solution V ⋅ j of (7) in at most 1 + log κ max ∘ /log(4) DC iterations when n is sufficiently large, where ⌈·⌉ is the ceiling function. Moreover, almost surely we have Algorithm 2 recovers ℰ + and J + when n is sufficiently large.
In view of Theorem 4 (proved in Appendix B.4), it suffices to specify the maximum number of DC iterations as 1 + log κ max ∘ /log(4) . Then the time complexity of Algorithm 1 (Efron et al., 2004). Note that Algorithm 2 does not involve heavy computation, so the overall time complexity for estimating the ARG (Algorithms 1-2) is O p × logκ max ∘ × q 3 + nq 2 . Finally, the peeling method does not apply to observational data (W = 0). In a sense, interventions are essential.
Theorem 4 establishes the consistent reconstruction by the peeling algorithm for the ARG. Yet, it does not provide any uncertainty measure for the presence of some directed edges in the true DAG. In what follows, we will develop an asymptotic theory for hypothesis tests concerning directed edges of interest.

Inferential theory
Assumption 4 is a hypothesis-specific condition restricting the underlying dimension of the testing problem. Usually, PA S (j) ≍ AN G (j) ≍ IN S (j) ≍ κ j ∘ ≪ p; 1 ≤ j ≤ p, which relaxes the condition n ≫ plog(p) |D| for the constrained likelihood ratio test (Li et al., 2020).

Theorem 5 (Empirical p-values)
Suppose Assumptions 1-4 are met and H 0 is regular. Assume the tuning parameters in Algorithm 1 satisfy the requirements in Theorem 4.

A.
For the test of directed edges (2),

B.
For the test of directed pathways (3), By Theorem 5 (proved in Appendix B.5), the DP likelihood ratio test yields a valid p-value for (2) and (3) under appropriate conditions. Note that |D| is permitted to depend on n.

Simulations
This section investigates the operating characteristics of the proposed tests and the peeling algorithm via simulations. In simulations, we consider two setups for generating U ∈ ℝ p × p , representing random and hub DAGs, respectively. , where A, B ∈ ℝ p × p and 0 ∈ ℝ (q − 2p) × p .
• Setup (B). Set A jj = A j, j + 1 = B jj = B j, j + 1 = 1; j = 1, …, p − 1, A pp = 1, while other entries of A, B are zero. Here, the only valid instrument is X p on Y p , and the other intervention variables either have two targets or are inactive. Importantly, Assumption 1C is not met.

Inference
We compare three tests in empirical type-I errors and powers in simulated examples, namely, the DP likelihood ratio test (DP-LR) in Algorithm 3, the asymptotic likelihood ratio test (LR), and the oracle likelihood ratio test (OLR). Here LR uses Lr, while OLR uses Lr(S, Σ) assuming that S were known in advance. The p-values of LR and OLR are computed via Proposition 6. The implementation details of these tests are in Appendix C.1.
For the empirical type-I error of a test, we compute the percentage of times rejecting H 0 out of 500 simulations when H 0 is true. For the empirical power of a test, we report the percentage of times rejecting H 0 out of 100 simulations when H a is true under alternative hypotheses H a .

• Test of directed edges.
For (2), we examine two different hypotheses: i. For testing directed edges, as displayed in Figure 2, DP-LR and LR perform well compared to the ideal test OLR in Setup (A) with Assumption 1 satisfied. In Setup (B) with Assumption 1C not fulfilled, DP-LR appears to have control of type-I error, whereas LR has an inflated empirical type-I error compared to the nominal level α = 0.05. This discrepancy is attributed to the data perturbation scheme accounting for the uncertainty of identifying S.
However, both suffer a loss of power compared to the oracle test OLR in this setup. This observation suggests that without Assumption 1C, the peeling algorithm tends to yield an estimate G + ⊇ G + , which overestimates G + , resulting in a power loss.
For testing directed pathways, as indicated in Figure 3, we observe similar phenomena as in the previous directed edge tests. Of note, both LR and DP-LR are capable in controlling type-I error of directed path tests.
In summary, DP-LR has a suitable control of type-I error when there are invalid instruments and Assumption 1C is violated. Concerning the power, DP-LR and LR are comparable in all scenarios and their powers tend to one as the sample size n or the signal strength of tested edges increases. Moreover, DP-LR and LR perform nearly as well as the oracle test OLR when Assumption 1 is satisfied. These empirical findings agree with our theoretical results.

Structure learning
This subsection compares the peeling algorithm with the Two-Stage Penalized Least Squares (2SPLS, Chen et al. (2018)) in terms of the structure learning accuracy. For peeling, we consider Algorithm 2 with an additional step (9) for structure learning of U. For 2SPLS, we use the R package BigSEM.
2SPLS requires that all the intervention variables to be target-known instruments in addition to Assumption 1C. Thus, we consider an additional Setup (C).
For 2SPLS, we assign each active intervention variable to its most correlated primary variable in Setups A-C. In Setup C, this assignment yields a correct identification of valid instruments, meeting all the requirements of 2SPLS.
For each scenario, we compute the structural Hamming distance (SHD) averaged over 100 runs. As shown in Figure 4, the peeling algorithm outperforms 2SPLS, especially when there are invalid instruments and Assumption 1C is violated.
Appendix C.2 contains additional numerical experiments on structure learning, including the results of different sparsity settings, SHD transition curves, and different numbers of interventions.

Comparison of inference and structure learning
This subsection compares the proposed DP testing method against the proposed structure learning method in (9) in terms of inferring the true graph structure. To this end, we consider Setup (A) in Section 5.1 with p = 30, q = 100, and the hypotheses H 0 : U 1, 20 = 0 versus H a : U 1, 20 = 1/ n .
For DP inference, we use α = 0.05 and choose the tuning parameters by BIC as in previous experiments; see Appendix C.1 for details. For structure learning, we reject the null hypothesis when U 1, 20 ≠ 0.
As displayed in Figure 5, when the null hypothesis H 0 is true, the DP testing method controls type-I error very close to the nominal level of 0.05, whereas the type-I error of the structure learning varies greatly depending on the tuning parameter selection. Under the alternative hypothesis H a , the DP inference enjoys high statistical power than structure learning methods when n ≥ 200. Interestingly, the power of structure learning diminishes as n increases. This observation is in agreement with our theoretical results in Theorem 10, suggesting that consistent reconstruction requires the smallest size of nonzero coefficients to be of order ≳ log(n)/n with the tuning parameter τ of the same order (fixing p, q). In this case, the edge U 1 , 20 is of order 1/ n, which is less likely to be reconstructed as n increases. In contrast, Proposition 7 indicates that a DP test has a non-vanishing power when the hypothesized edges are of order 1/ n. Figure 5 demonstrates some important distinctions between inference and structure learning. When different tuning parameters are used, the structure learning results correspond to different points on an ROC curve. Although it is asymptotically consistent when optimal tuning parameters are used, structure learning lacks an uncertainty measure of graph structure identification. As a result, it is nontrivial for structure learning methods to trade-off the false discovery rate and detection power in practice. This makes the interpretation of such results hard, especially when they heavily rely on hyperparameters as in Figure 5. By comparsion, DP inference aims to maximize statistical power while controlling type-I error at a given level, offering a clear interpretation of its result. This observation agrees with the discussions in the literature on variable selection and inference ( Wasserman and Roeder, 2009; Meinshausen and Bühlmann, 2010;Lockhart et al., 2014;Candes et al., 2018) and it justifies the demand for inferential tools for directed graphical models.

ADNI data analysis
This section applies the proposed tests to analyze an Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. In particular, we infer gene pathways related to Alzheimer's Disease (AD) to highlight some gene-gene interactions differentiating patients with AD/ cognitive impairments and healthy individuals.
The raw data are available in the ADNI database (https://adni.loni.usc.edu), including gene expression, whole-genome sequencing, and phenotypic data. After cleaning and merging, we have a sample size of 712 subjects. From the KEGG database ( Kanehisa and Goto, 2000), we extract the AD reference pathway (hsa05010, https://www.genome.jp/pathway/ hsa05010), including 146 genes in the ADNI data.
For data analysis, we first regress the gene expression levels on five covariates -gender, handedness, education level, age, and intracranial volume, and then use the residuals as gene expressions in the following analysis. Next, we extract the genes with at least one SNP at a marginal significance level below 10 −3 , yielding p = 63 genes as primary variables. For these genes, we further extract their marginally most correlated two SNPs, resulting in q = 63 × 2 = 126 SNPs as unspecified intervention variables, in subsequent data analysis. All gene expression levels are normalized. In the literature, genes APP, CASP3, and PSEN1 are well-known to be associated with AD, reported to play different roles in AD patients and healthy subjects (Julia and Goate, 2017; Su et al., 2001;Kelleher III and Shen, 2017). For this dataset, we conduct hypothesis testing on edges and pathways related to genes APP, CASP3, and PSEN1 in the KEGG AD reference (hsa05010) to evaluate the proposed DP inference by checking if DP inference can discover the differences that are reported in the biomedical literature. First, we consider testing H 0 : U kj = 0 versus H a : U kj ≠ 0, for each edge (k, j) as shown in Figure 6 (a) and (b). Moreover, we consider two hypothesis tests of pathways H a : U kj = 0 for some (k, j ∈ P ℓ versus H a : U kj ≠ 0 for all (k, j ∈ P ℓ ; ℓ = 1, 2, where the two pathways are specified by  Figure 7, the pathway test (3) supports the presence of a pathway PSEN1 → CAPN1 → CDK5R1 in the AD-MCI group with a p-value of 0.044 but not in the CN group with a p-value of 0.33. The pathway PSEN1 → CAPN2 → CDK5R1 appears insignificant at α = 0.05 for both groups. Also noted is that some of our discoveries agree with the literature according to the AlzGene database (alzgene.org) and the AlzNet database (https:// mips.helmholtz-muenchen.de/AlzNet-DB). Specifically, GSK3B differentiates AD patients from normal subjects; as shown in Figures 6, our result indicates the presence of connection APP → GSK3B for the AD-MCI group, but not for the CN group, the former of which is confirmed by Figure 1 of Kremer et al. (2011). The connection APP → APBB1 also differs in AD-MCI and CN groups, which appears consistent with Figure 3 of Bu (2009). Moreover, the connection CAPN1 → CDK5R1, in the pathway PSEN1 → CAPN1 → CDK5R1 discovered in AD-MCI group, is found in the AlzNet database (interaction-ID 24614, https:// mips.helmholtz-muenchen.de/AlzNet-DB/entry/show/1870). Finally, as suggested by Figure  8, the normality assumption in (1) is adequate for both groups.
By comparison, as shown in Figure 6 (c) and (d), gene APP in the reconstructed networks by 2SPLS (Chen et al., 2018) is not connected with other genes, indicating no regulatory relation of APP with other genes in the AD-MCI and CN groups. However, as a wellknown gene associated with AD, APP is reported to play different roles in controlling the expressions of other genes for AD patients and healthy people (Matsui et al., 2007;Julia and Goate, 2017). Our results in Figure 6 (a) and (b) are congruous with the studies: the connections of APP with other genes are different in our estimated networks for AD-MCI and CN groups.
In summary, our findings seem to agree with those in the literature ( Julia and Goate, 2017;Su et al., 2001;Kelleher III and Shen, 2017), where the subnetworks of genes APP, CASP3 in Figure 6 and PSEN1 in Figure 7 differentiate the AD-MCI from the CN groups. Furthermore, the pathway PSEN1 → CAPN1 → CDK5R1 in Figure 7 seems to differentiate these groups, which, however, requires validation in biological experiments.

Summary
This article proposes structure learning and inference methods for a Gaussian DAG with interventions, where the targets and strengths of interventions are unknown. A likelihood ratio test is derived based on an estimated ARG formed by ancestral relations and candidate interventional relations. This test accounts for the statistical uncertainty of the construction of the ARG based on a novel data perturbation scheme. Moreover, we develop a peeling algorithm for the ARG construction. The peeling algorithm allows scalable computing and yields a consistent estimator. The numerical studies justify our theory and demonstrate the utility of our methods.
The proposed methods can be extended to many practical situations beyond biological applications with independent and identically distributed data. An instance is to infer directed relations between multiple autoregressive time series (Pamfil et al., 2020), where the lagged variables and covariates can serve as interventions for each time series.
The current work has two limitations. First, the inferential theory requires (asymptotically) correct recovery of the local DAG structures (Remark 8) to produce valid p-values, similar to Shi et al. (2019) and Zhu et al. (2020). As illustrated in numerical studies, the graph structures are reasonably recovered when n is moderately large, and the DP scheme empirically alleviates the issue of inference after the ARG reconstruction. However, whether valid p-values can be obtained without the exact reconstruction of nuisance graph structures remains unclear in theory. Second, the proposed methods do not treat hidden confounding, which often arises in practice and can bias the results of both inference and learning. One future research direction is to extend the framework of unspecified interventions to allow unmeasured confounders.
As suggested by Proposition 1, Assumption 1 (1A-1C) suffices for identification of every parameter value in the parameter space. Next, we show by examples that if Assumption 1B or 1C is violated then model (1) is no longer identifiable. To proceed, we rewrite (1) as where Ω = (I − U)Σ −1 I − U ⊤ is a precision matrix and V = W(I − U) −1 .

A.2 Illustration of Algorithm 2
We now illustrate Algorithm 2 by Example 3.
-X 4 is identified as an instrument of leaf node Y 5 (X 4 → Y 5 ) because V 45 ≠ 0 is the only nonzero in row 4 with the smallest (positive) row ℓ 0 -norm.
Then {Y 4 , Y 5 } are removed. -X 5 is identified as an instrument of a leaf node Y 3 (X 5 → Y 3 ) in G work given that V 53 ≠ 0 is the only nonzero element in the row with the smallest (positive) row ℓ 0 -norm of the submatrix for Y 1 , Y 2 , Y 3 .
Since Y 3 is a leaf in G work , Y 4 has been removed, X 5 is the only instrument on Y 3 , and V 54 ≠ 0, we have (3, 4) ∈ ℰ + by Proposition 3. Then {Y 3 } is removed. -X 3 is identified as an instrument of a leaf node Y 2 (X 3 → Y 2 ) similarly in G work given that V 32 ≠ 0 is the largest nonzero element in its row of the submatrix.
Since Y 2 is a leaf in G work , Y 3 has been removed, X 3 is the only instrument on Y 2 , and V 33 ≠ 0, we have (2, 3) ∈ ℰ + . Then {Y 2 } is removed. -X 1 is an instrument of Y 1 (X 1 → Y 1 ).
Since Y 1 is a leaf node in G work , Y 2 has been removed, X 1 is the only instrument on Y 1 , and V 12 ≠ 0, we have (1, 2) ∈ ℰ + . Then {Y 1 } is removed, and the peeling process is terminated.
Finally, Steps 9 and 10 identify which are equal to ℰ + and ℐ + , respectively. Li et al. Page 25 In Example 3 {(l, j): V lj ≠ 0} ≠ {(l, j): V lj ≠ 0}, suggesting that the selection consistency of V is unnecessary for Algorithm 2 to correctly reconstruct the ARG G + ; see also Section A.3 for the theoretical justification.

A.3 Relaxation of Assumption 3
Assumption 3 in Theorem 4 leads to consistent identification for V. Now, we discuss when Algorithm 2 correctly reconstructs G + without requiring Assumption 3.
Assumption 5 For 1 ≤ j ≤ p, there exists τ j * such that

A.
l: X l intervenes on Y j or its unmediated parents ⊆ l: V lj ≥ τ j * .
Assumption 5 requires the effects of intervention variables on Y j or its unmediated parents to exceed a certain signal strength τ j * , while imposing no restrictions on the other intervention variables, for 1 ≤ j ≤ p. These signals enable us to reconstruct G + . Assumption 3 implies Assumption 5 with τ j * = min V lj ≠ 0 V lj , so Assumption 5 is weaker.
The proof of Theorem 9 is given in Appendix B.9.

A.4 Comparison of strong faithfulness and Assumption 3 (or 5)
In the literature, a faithfulness condition is usually assumed for identifiability up to Markov equivalence classes (Spirtes et al., 2000). For discussion, we formally introduce the concepts of faithfulness and strong faithfulness.
Consider a DAG G with node variables (Z 1 , …, Z p + q ) ⊤ . Nodes Z i and Z j are adjacent if Z i Z j or Z j Z i . A path (undirected) between Z i and Z j in G is a sequence of distinct Otherwise it is a noncollider. Let A ⊆ 1, …, p + q , where A does not contain i and j. Then Z A blocks a path (Z i , …, Z j ) if at least one of the following holds: (i) the path contains a noncollider that is in Z A , or (ii) the path contains a collider that is not in Z A and has no descendant in Z A . A node Z i is d-separated from Z j given Z A if Z A block every path between Z i and Z j ; i ≠ j (Pearl, 2009).
According to Uhler et al. (2013), a multivariate Gaussian distribution of (Z 1 , …, Z p + q ) ⊤ is said to be ς-strong faithful to a DAG with node set V = {1, …, p + q} if for 1 ≤ i ≠ j ≤ p + q, where ς ∈ [0, 1), Corr denotes the correlation. When ς = 0, (24) is equivalent to faithfulness. For consistent structure learning (up to Markov equivalence classes), it often requires that ς ≳ s 0 log(p + q)/n, where s 0 is a sparsity measure; see Uhler et al. (2013) for a survey. For a pair (i, j), the number of possible sets for A is 2 (p+q−2) . If Z i → Z j , then Corr(Z i , Z j | Z A ) ≠ 0 for any A. Therefore, for this (i, j) pair alone, (24) could require exponentially many conditions. Indeed, (24) is very restrictive in high-dimensional situation (Uhler et al., 2013).
By comparison, Algorithm 2 yields consistent structure learning based on Assumption 3 or 5 instead of strong faithfulness. In some sense, Assumption 3 or 5 requires the sufficient signal strength that is analogous to the condition for consistent feature selection (Shen et al., 2012). This assumption may be thought of as an alternative to strong faithfulness. As illustrated in Example 4, Assumption 3 or 5 is less stringent than strong faithfulness.
Importantly, (i)-(vi) are required in (25), suggesting that the strong-faithfulness is more stringent than Assumption 3.

A.5 Irregular hypothesis
Assume, without loss of generality, that D = D and G + = G are correctly reconstructed in the following discussion.
• For testing of directed edges (2), suppose H 0 is irregular, namely, D ∪ ℰ contains a directed cycle. This implies that a directed cycle exists in D ∪ ℰ + . In this situation, we decompose H 0 into sub-hypotheses H 0 (1) , …, H 0 (ν) , each of which is regular. Then testing H 0 is equivalent to multiple testing for H 0 • For testing of directed pathways (3), if H 0 is irregular, then D ∪ ℰ + has a directed cycle. The p-value is defined to be one in this situation since no evidence supports the presence of the pathway.

A.6 Theoretical results on structure learning
The regression (9) can be solved by Algorithm 1 with the input Y ⋅j as the response variable and the input (Y ⋅ , AN G + j , X ⋅ , IN G + j ) as the covariates for 1 ≤ j ≤ p. The tuning parameters for solving (9) by Algorithm 1 are denoted by κ j ′ , τ j ′ 1 ≤ j ≤ p .
Denote by G(θ) and G(θ) the DAGs corresponding to θ and θ, respectively. First, consider G(θ). Without loss of generality, assume Y 1 is a leaf node in G(θ). By Assumption 1C, there exists an instrumental intervention with respect to G(θ), say X 1 . Then, Cov(Y j , X 1 | X 2, …, q ) = 0, j = 2, …, p, Cov(Y 1 , X 1 | Y A , X 2, …, q ≠ 0, for any A ⊆ {2, …, p} . By the local Markov property (Spirtes et al., 2000), (27) implies that X 1 Y 1 in G(θ). Suppose Y 1 is not a leaf node in G(θ). Without loss of generality, assume that Y 1 is an unmediated parent of Y 2 . Then Cov(Y 2 , X 1 | X { 2, … ,q } ) = 0 but X 1 → Y 1 and Y 1 is an unmediated parent of Y 2 , which contradicts to Assumption 1B. This implies that if Y 1 is a leaf node in G(θ) then it must be a leaf node in G(θ). In both G(θ) and G(θ), the parents and interventions of Y 1 can be identified by Consequently, Y 1 has the same parents and interventions in G(θ) and G(θ).
The forgoing argument is applied to other nodes sequentially. First, we remove Y 1 with any directed edges to Y 1 , which does not alter the joint distribution of (Y { 2, … , p } , X) and the sub-DAG of nodes Y 2 , … , Y p . By induction, we remove the leafs in G(θ) until it is empty, leading to G(θ) = G(θ). Finally, θ = θ because they have the same locations for nonzero elements and these model parameters (or regression coefficients) are uniquely determined under Assumption 1A ( Shojaie and Michailidis, 2010). This completes the proof.

PROOF OF (A)
Note that the maximal length of a path in a DAG of p nodes is at most p − 1. Then it can be verified that U is nilpotent in that U p = 0. An application of the matrix series expansion Li et al. Page 29 yields that (I − U) −1 = I + U + ⋯ + U p − 1 . Using the fact that V = W(I − U) −1 from (6) where U kj is the (k, j)th entry of U. If V lj ≠ 0, then there exists k such that W lk ≠ 0 and (U r ) kj ≠ 0 for some 0 ≤ r ≤ p − 1. If r = 0, then we must have k = j, and X l Y j . If r > 0, then X l Y k and Y k is an ancestor of Y j .

PROOF OF (B)
First, for any leaf node variable Y j , by Assumption 1, there exists an instrument X l Y j . If V lj′ ≠ 0 for some j′ ≠ j, then Y j must be an ancestor of Y j′ , which contradicts the fact that Y j is a leaf node variable.
Conversely, suppose that V lj ≠ 0 and V lj′ = 0 for j′ ≠ j. If Y j is not a leaf node variable, then there exists a variable Y j′ such that Y j is an unmediated parent of Y j′ , that is U jj′ ≠ 0 and U r jj′ = 0 for r > 1. Then V lj′ = W lj U jj′ ≠ 0, a contradiction.

B.3 Proof of Proposition 3
Suppose Y k is an unmediated parent of Y j . Let X l be an instrument of Y k in G ℒ c . Then there are two cases: (1) X l intervenes on Y k but does not intervene on Y j , namely X l → Y k but X l ↛ Y j ; (2) X l intervenes on Y k and Y j simultaneously, namely X l Y k and X l Y j . For (1), V lj = W lk U kj ≠ 0. For (2), Assumption 1B implies that V lj ≠ 0. This holds for every instrument of Y k in G ℒ c , and the desired result follows.
Conversely, suppose for each instrument X l of Y k in G ℒ c , we have V lj ≠ 0. Let X l′ be an instrument of Y k in G, which is also an instrument in G ℒ c . Then 0 ≠ V l′j = W l′k U kj , which implies U kj ≠ 0. This completes the proof.

B.4 Proof of Theorem 4
The proof proceeds in two steps: (A) we show that {l: V lj ≠ 0} = {l: V lj ≠ 0} almost surely for 1 ≤ j ≤ p; and (B) we show that G + = G + if V satisfies the property in (A).

PROOF OF (A)
Let A j°= l: V lj ≠ 0 and A j where ξ j = Y j − XV · j° is the residual of the oracle least squares estimate V · j° such that A j = l: V lj°≠ 0 ;  (8), where V t is defined in (8). Plugging ξ j = Y j − XV · j° into the inequality and rearranging it, we have ‖X(V · j [t] − V · j°) ‖ 2 2 /n is no greater than where X A j * ⊤ ξ j /n = 0 has been used. Note that Then (28) is no greater than where Δ denotes the symmetric difference of two sets. Note that ‖X(V · j [t] − V · j°) ‖ 2 2 /n ≥ 0.
Hence, ℒ is the index set of leaves in G work . By Proposition 3, all (k, j) such that Y k is an unmediated parent of a peeled variable Y j are in ℰ + and can be identified by Assumption 1B.
In model (1), after removing Y ℒ , the local Markov property of the rest variables in G work remain intact. Then we repeat the procedure until all primary variables are removed.
As a result, Algorithm 2 correctly identifies a subset of ℰ + that contains all edges from an unmediated parent, so it suffices to recover ℰ + . Consequently, ℰ + can be reconstructed by Step 9 and ℐ + can be recovered by Step 10. This completes the proof of (B).

Proof of Lemma 11
Let Z = (Y, X) as in (13). Given data submatrix Z A j , e j ⊤ (P A j − P B j )e j σ j 2 |Z A j χ |A j | − |B j | 2 , e j ⊤ (I − P A j )e j σ j 2 |Z A j χ n − |A j | 2 , and they are independent, because P A j − P B j and I − P A j are orthonormal projection matrices, e j | Z A j N(0, σ j 2 I n ), and (P A j − P B j )(I − P A j ) = 0. Then, for any real number t, the characteristic function t Eexp ιtT j / A j − B j is (ι is the imaginary unit) E(E(exp(ιtT j /(|A j | − |B j |)) | Z A j )) = Eψ |A j | − |B j |, n − |A j | (t) = ψ |A j | − |B j |, n − |A j | (t), where ψ |A j | − |B j |, n − |A j | is the characteristic function of F-distribution with degrees of freedom |A j | − |B j |, n − |A j |. Hence, T j /(|A j | − |B j |) F |A j | − |B j |, n − |A j | . Similarly, we also have T j * /(|A j | − |B j |) F |A j | − |B j |, n − |A j | ; j = 1, … , p.
Next, we prove independence for T and T* via a peeling argument. Let t = (t 1 , … , t p ). Let Y j be a leaf node of the graph G. Then T −j | Y −j , X is deterministic, where T −j is the subvector of T with the jth component removed. The characteristic function of t ⊤ T is Eexp ιt ⊤ T = E E exp ιt j T j | Y −j , X exp ιt −j ⊤ T −j = ψ |A j | − |B j |, n − |A j | t j Eexp(ιt −j ⊤ T −j ), where ψ |A j | − |B j |, n − |A j | is the characteristic function of the F-distribution with degrees of freedom A j − B j , n − A j . Next, let Y j′ be a leaf node of the graph G′ and apply the law of iterated expectation again, where G′ is the subgraph of G without node Y j . Repeat this procedure and after p steps Eexp(ιt ⊤ T) = ∏ j = 1 p ψ |A j | − |B j |, n − |A j | t j , which implies (T 1 , … , T p ) are independent.
Similarly, T* also has independent components given Z and has the same distribution as T. This completes the proof.
2SPLS (Chen et al., 2018): • We use the R package BigSEM for 2SPLS. In our experiments, the λ sequence of 2SPLS (Chen et al., 2018) is set in the same way as the γ sequence of the proposed methods. We use the default settings for other parameters.

C.2 Additional simulations
This section supplements Section 5.2 by including additional numerical experiments.

C.2.1 RANDOM GRAPHS WITH DIFFERENT SPARSITY
We examine the proposed method for structure learning of DAGs with different sparsity. For U, the upper off-diagonal entries U kj ; k < j are sampled independently from {0, 1} according to Bernoulli(s/p); s = 1, 2, 3, 4, while other entries are zero. The rest of the settings remain the same as in Section 5.2. Figure 11 displays the results. As expected, the performance improves when the sample size n increases and the DAG becomes sparser (controlled by s). SHDs for the reconstructed DAG by the peeling algorithm.

C.2.2 SHD TRANSITION CURVES OF STRUCTURE LEARNING
We consider different sample sizes n = 50, 100, 150, 200 to further examine how the proposed method depends on n. Figure  SHDs for the reconstructed DAG by the peeling algorithm. The experiment settings are the same as the ones in Section 5.2.

C.2.3 STRUCTURE LEARNING WITH DIFFERENT NUMBERS OF INTERVENTIONS
Finally, we investigate how the number of interventions q affects the learning outcomes. Figure 13 displays the results when q = 500, 1000, 1500. It suggests that the proposed method performs reasonably well at a moderate sample size when many unknown interventions are present. Li et al. Page 42 Figure 13: SHDs for the reconstructed DAG by the peeling algorithm. The experiment settings are the same as the ones in Section 5.2.

Figure 1:
A DAG G of five primary variables Y 1 , … , Y 5 and five intervention variables X 1 , … , X 5 , where directed edges are represented by solid arrows while dependencies among X are not displayed.   Rejection rates for H 0 : U 1,20 = 0 versus H a : U 1, 20 = 1/ n by DP inference and structure learning. For structure learning, H 0 is rejected if U 1, 20 ≠ 0. The red lines indicate the results of DP inference using the significance level α = 0.05. The other colored lines display the results of structure learning using different sparsity parameter values κ = 1, 2, 3, 4, 5. The simulation is repeated for 500 times and κ = 2 is chosen by BIC in over 90% cases. Li et al. Page 51